Encoded Natural Language Text

نویسندگان

  • William J. Teahan
  • Khaled M. Alhawiti
چکیده

In this paper, several new universal preprocessing techniques are described to improve Prediction by Partial Matching (PPM) compression of UTF-8 encoded natural language text. These methods essentially adjust the alphabet in some manner (for example, by expanding or reducing it) prior to the compression algorithm then being applied to the amended text. Firstly, a simple bigraphs (two-byte) substitution technique is described that leads to significant improvement in compression for many languages when they are encoded by the Unicode scheme (25% for Arabic text, 14% for Armenian, 9% for Persian, 15% for Russian, 1% for Chinese text, and over 5% for both English and Welsh text). Secondly, a new preprocessing technique that outputs separate vocabulary and symbols streams – that are subsequently encoded separately – is also investigated. This also leads to significant improvement in compression for many languages (24% for Arabic text, 30% for Armenian, 32% for Persian and 35% for Russian). Finally, novel preprocessing and postprocessing techniques for lossy and lossless text compression of Arabic text are described for dotted and non-dotted forms of the language.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Natural Language Generation for Text-to-Text Applications Using an Information-Slim Representation

I propose a representation formalism and algorithms to be used in a new language generation mechanism for text-to-text applications. The generation process is driven by both text-specific information encoded via probability distributions over words and phrases derived from the input text, and general language knowledge captured by n-gram and syntactic language models. A Text-to-Text Perspective...

متن کامل

American Sign Language Generation: Multimodal NLG with Multiple Linguistic Channels

Software to translate English text into American Sign Language (ASL) animation can improve information accessibility for the majority of deaf adults with limited English literacy. ASL natural language generation (NLG) is a special form of multimodal NLG that uses multiple linguistic output channels. ASL NLG technology has applications for the generation of gesture animation and other communicat...

متن کامل

Text-Driven Forecasting

Forecasting the future hinges on understanding the present. The web—particularly the social web—now gives us an up-to-the-minute snapshot of the world as it is and as it is perceived by many people, right now, but that snapshot is distributed in a way that is incomprehensible to a human. Much of this data is encoded in text, which is noisy, unstructured, and sparse; yet recent developments in n...

متن کامل

Evidentiality for Text Trustworthiness Detection

Evidentiality is the linguistic representation of the nature of evidence for a statement. In other words, it is the linguistically encoded evidence for the trustworthiness of a statement. In this paper, we aim to explore how linguistically encoded information of evidentiality can contribute to the prediction of trustworthiness in natural language processing (NLP). We propose to incorporate evid...

متن کامل

Towards a Better Understanding of the Language Content in the Semantic Web

Internet content today is about 80% text-based. No matter static or dynamic, the information is encoded and presented as multilingual, unstructured natural language text pages. As the Semantic Web aims at turning Internet into a machine-understandable resource, it becomes important to consider the natural language content and to assess the feasibility and the innovation of the semantic-based ap...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015